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Abstract —Achieving high performance in many dis¬ 
tributed systems requires finding a good assignment of 
threads to servers as well as effectively allocating each 
server’s resources to its assigned threads. The assignment 
and allocation components of this problem have been 
studied largely separately in the literature. In this paper, 
we introduce the assign and allocate (AA) problem, which 
seeks to simultaneously find an assignment and allocations 
that maximize the total utility of the threads. Assigning and 
allocating the threads together can result in substantially 
better overall utility than performing the steps separately, 
as is traditionally done. We model each thread by a concave 
utility function giving its throughput as a function of its 
assigned resources. We first show that the AA problem 
is NP-hard, even when there are only two servers. We 
then present a 2(%/2 — 1) > 0.828 factor approximation 
algorithm, which runs in 0{mn^ + n(logmC)^) time for 
n threads and m servers with C amount of resources 
each. We also present a faster algorithm with the same 
approximation ratio and 0(n(logmC)^) running time. 
We conducted experiments to test the performance of 
our algorithm on threads with different types of utility 
functions, and found that it achieves over 99% of the 
optimal utility on average. We also compared our algorithm 
against several other assignment and allocation algorithms, 
and found that it achieves up to 5.7 times better total utility. 

Keywords; Resource allocation; algorithms; multi¬ 
cores; cloud 

1. Introduction 

In this paper, we study efficient ways to run a set 
of threads on multiple servers. Our problem consists 
of two steps. First, each thread is assigned to a server. 
Subsequently, the resources at each server are allocated 
to the threads assigned to it. Each thread obtains a certain 
utility based on the resources it obtains. The goal is to 
maximize, over all possible assignments and allocations, 
the total utility of all the threads. We call this problem 
AA, for assign and allocate. 

While our terminology generically refers to threads 
and servers, the AA problem can model a range of dis¬ 
tributed systems. Eor example, in a multicore processor, 
each core corresponds to a server offering its shared 
cache as a resource to concurrently executing threads. 
Each thread is hrst bound to a core, after which cache 


partitioning a, El can enforce an allocation of the 
core’s cache among the assigned threads. Each thread’s 
performance is a (typically concave) function of the size 
of its cache partition. As the threads’ cache requirements 
differ, achieving high overall utility requires both a 
proper mapping of threads to the cores and properly 
partitioning each core’s cache. Another application of 
AA is a web hosting center in which a number of web 
service threads run on multiple servers and compete 
for resources such as processing or memory. The host 
seeks to maximize system utility to run a larger number 
of services and obtain greater revenue. Lastly, consider 
a cloud computing setting in which a provider sells 
virtual machine instances (threads) running on physical 
machines (servers). Customers express their willingness 
to pay for instances consuming different amounts of 
resources using utility functions, and the provider’s task 
is to assign and size the virtual machines to maximize 
her proht. 

The two steps in AA correspond to the thread assign¬ 
ment and resource allocation problems, both of which 
have been studied extensively in the literature. However, 
to our knowledge, these problems have not been studied 
together in the unihed context considered in this paper. 
Existing works on resource allocation na, mu, a, ei, 
El largely deal with dividing the resources on a single 
server among a given set of threads. It is not clear how to 
apply these algorithms when there are multiple servers, 
since there are many possible ways to initially assign 
the threads to servers, and certain assignments result in 
low overall performance regardless of how resources are 
subsequently allocated. Eor example, if there are two 
types of threads, one with high maximum utility and 
one with low utility, then assigning all the high utility 
threads to the same server will result in competition 
between them and low overall utility no matter how 
resources are allocated. Likewise, existing works on 
thread assignment 121, 18), E), 16) often overlook the 
resource allocation aspect. Typically each thread requests 
a hxed amount of resource, and once assigned to a 
server it is allocated precisely the resource it requested. 


without any adjustments based on the requests of the 
other threads assigned to the same server. This can also 
lead to suboptimal performance. For example, consider 
a thread that obtains utility when allocated x amount 
of resource, for some /3 G (0, and suppose the thread 
requests z resources, for an arbitrary z. Then when there 
are n threads and one server with C resources, — of the 

Z 

threads would receive z resources each while the rest 
receive 0, leading to a total utility of which is 

constant in n. However, the optimal allocation gives — 
resources to each thread and has total utility 
which is arbitrarily better for large n. 

The AA problem combines thread assignment and 
resource allocation. We model each thread using a non¬ 
decreasing and concave utility function giving its perfor¬ 
mance as a function of the resources it receives. The goal 
is to simultaneously hnd assignments and allocations for 
all the threads that maximizes their total utility. We hrst 
show that the AA problem is NP-hard, even when there 
are only two servers. In contrast, the problem is effi¬ 
ciently solvable when there is a single server ifThll . Next, 
we present an approximation algorithm that finds a so¬ 
lution with utility at least a = 2(\/2— 1) > 0.828 times 
the optimal. The algorithm relates the optimal solution 
of a single server problem to an approximately optimal 
solution of the multiple server problem. The algorithm 
runs in time 0{mn^ + n{\ogmC)'^), where n and m are 
the number of threads and servers, respectively, and C is 
the amount of resource on each server. We then present 
a faster algorithm with 0(n(logTOC')^) running time 
and the same approximation ratio. Finally, we conduct 
experiments to test the performance of our algorithm. 
We used several types of synthetic threads with different 
utility functions, and show that our algorithm obtains at 
least 99% of the maximum utility on average for all 
the thread types. We also compared our algorithm with 
several simple but practical assignment and allocation 
heuristics, and show that our algorithm performs up to 
5.7 times better when the threads have widely differing 
utility functions. 

The rest of paper is organized as follows. In Section 
2, we describe related works on thread assignment and 
resource allocation. Sections 3 formally dehnes our 
model and the AA problem. Section 4 proves AA is NP- 
hard. Section 5 presents an approximation algorithm and 
its analysis, and Section 6 proposes a faster algorithm. In 
Section 7, we describe our experimental results. Finally, 
Section 8 concludes. 

II. Related Works 

There is a large body of work on resource allocation 
for a single server. Fox et al. im considered concave 

*For ^ this is known as the “square root rule” m, m 


utility functions and proposed a greedy algorithm to hnd 
an optimal allocation in 0{nC) time, where n is the 
number of threads and C is the amount of resource on 
the server. Galil m proposed an improved algorithm 
with 0(n(logC')^) running time, by doing a binary 
search to hnd an allocation in which the derivatives 
of all the threads’ utility functions are equal, and the 
total resources used by the allocation is C. Resource 
allocation for nonconcave utility functions is weakly NP- 
complete. However, Lai and Fan mi identihed a struc¬ 
tural property of real-world utility functions that leads 
to fast parameterized algorithms. In a multiprocessor 
setting, utility typically corresponds to cache miss rate. 
Miss rate curves can be determined by running threads 
multiple times using different cache allocations. Qureshi 
et al. i) proposed efficient hardware based methods 
to minimize the overhead of this process, and also 
designed algorithms to partition a shared cache between 
multiple threads using methods based on mi. Zhao et 
al. Qol proposed software based methods to measure 
page misses and allocate memory to virtual machines in 
a multicore system. Chase et al. used utility functions 
to dynamically determine resource allocations among 
users in hosting centers and maximize total proht. 

There has also been extensive work on assigning 
threads to cores in multicore architectures. Becchi et 
al. 13 proposed a scheme to determine an optimized 
assignment of threads to heterogeneous cores. They 
characterized the behavior of each thread on a core by 
a single value, its IPC (instructions per clock), which 
is independent of the amount of resource the thread 
uses. In contrast, we model a thread using a utility 
function giving its performance for different resource 
allocations. Radojkovic et al. H proposed a statistical 
approach to assign threads on massively multithreaded 
processors, choosing the best assignment out of a large 
random sample. The results in 13 and 13 are based 
on simulations and provide no theoretical guarantees. 
A problem related to thread assignment is application 
placement, in which applications with different resource 
requirements need to be mapped to servers while fulhll- 
ing certain quality of service guarantees. Urgaokar et al. 
0 proposed offline and online approximation algorithms 
for application placement. The offline algorithm achieves 
a i approximation ratio. However, these algorithms are 
not directly comparable to ours, as they consider multiple 
types of resources while we consider one type. 

Co-scheduling is a technique which divides a set of 
threads into subsets and executes each subset together 
on one chip of a multicore processor. It is frequently 
used to minimize cache interference between threads. 
Jiang et al. ifTSll proposed algorithms to hnd optimal co¬ 
schedules for pairs of threads, and also approximation 
algorithms for co-scheduling larger groups of threads. 


Tian et al. M also proposed exact and approximation 
algorithms for co-scheduling on chip multiprocessors 
with a shared last level cache. Zhuralev et al. ns 
surveyed scheduling techniques for sharing resources in 
multicores. The works on co-scheduling require measur¬ 
ing the performance from running different groups of 
threads together. When co-scheduling large groups, the 
number of measurements required becomes prohibitive. 
In contrast, we model threads by utility functions, which 
can be determined by measuring the performance of 
individual threads instead of groups. 

The AA problem is related to the multiple knapsack 
and also multiple-choice knapsack (MCKP) problems. 
There are many works on both problems. For the for¬ 
mer, Neebe et al. Il20l proposed a branch-and-bound 
algorithm, and Chekuri et al. ED proposed a PTAS. 
The multiple knapsack problem differs from AA in that 
each item, corresponding to a thread, has a single weight 
and value, corresponding to a single resource allocation 
and associated throughput; in contrast, we use utility 
functions which allow threads a continuous range of 
allocations and throughputs. The MCKP problem can 
model utility functions as it considers classes of items 
with different weights and values and chooses one item 
from each class; the class corresponds to a utility func¬ 
tion. However, MCKP only considers a single knapsack, 
and thus corresponds to a restricted form of AA with 
one server. Kellerer Oil proposed a greedy MCKP 
algorithm. Lawler CD proposed a 1 — e approximate 
algorithm, while Gens and Levner m proposed a | 
approximate algorithm with better running time. AA can 
be seen as a combined multiple-choice multiple knapsack 
problem. We are not aware of any previous work on this 
problem. This paper is a first step for the case where the 
ratios of values to weights in each item class is concave 
and there are items for every weight. 

III. Model 

In this section we formally define our model and 
problem. We consider a set of m homogeneous servers 
si,...,Sm, where each server has C > 0 amount of 
resources. The homogeneity assumption is reasonable in 
a number of settings. For example, multicore processors 
typically contain shared caches of the same size, and dat¬ 
acenters often have many identically configured servers 
for ease of management. Homogeneous servers have also 
been widely studied in the literature 0, El- We also 
have n threads fi,... ,f„. We imagine the threads are 
performing long-running tasks, and so the set of threads 
is static. Let S and T denote the set of servers and 
threads, respectively. Each thread L is characterized by a 
utility function fi : [0, C] —>■ IR-°, giving its performance 
as a function of the resources it receives. We assume that 
fi is nonnegative, nondecreasing and concave. The con¬ 


cavity assumption models a diminishing returns property 
frequently observed in practice ffl. While it does not 
apply in all settings, concavity is a common assumption 
used in many utility models, especially on cache and 
memory performance 0, El- 

Our goal is to assign the threads to the servers in a 
way that respects the resource bounds and maximizes 
the total utility. While a solution to this problem in¬ 
volves both an assignment of threads and allocations of 
resources, for simplicity we use the term assignment to 
refer to both. Thus, an assignment is given by a vector 
[(ri, Cl),..., (r„, c„)], indicating that each thread L is 
allocated Ci amount of resource on server 5^,- Let Sj 
be the set of threads assigned to server Sj. That is, 
Sj = {i\ri = j}. Then for all 1 < j < m, we require 

threads assigned to Sj use 
at most C resources. We assume that every thread is 
assigned to some server, even it receives 0 resources 
on the server. The total utility from an assignment is 
/*(g) = HidT The AA (assign 

and allocate) problem is to find an assignment that 
maximizes the total utility. 


IV. Hardness oe the Problem 


In this section, we show that it is NP-hard to find an 
assignment maximizing the total utility, even when there 
are only two servers. Thus, it is unlikely there exists an 
efficient optimal algorithm for the AA problem. This 
motivates the approximation algorithms we present in 
Sections |V]and|^ 


Theorem IV.l. Finding an optimal AA assignment is 
NP-hard, even when there are only two servers. 


Proof: We give a reduction from the NP-hard 
partition problem ll22l to the AA problem with two 
servers. In the partition problem, we are given a set of 
numbers S = {ci,..., c„} and need to determine if there 
exists a partition of S into sets Si and S 2 such that 
EiGSi instance of partition, 

we create an AA instance A with two servers each with 
(7=1 Er=i amount of resources. There are n threads 
ti,... ,tn, where the i’th thread has utility function fi 
defined by 


fiix) 


X if cc < Ci 

Ci otherwise 


The fi functions are nondecreasing and concave. We 
claim the partition instance has a solution if and only if 
A’s maximum utility is Er=i Tor the if direction, let 
A* = [(r*, cl),..., (r*, c*)] denote an optimal solution 
for A, and let 5'i and S 2 be the set of threads assigned 
to the servers 1 and 2, respectively. We show that 
Si, S 2 solve the partition problem. We hrst show that 
c* = Ci for all i. Indeed, if 4 < Ci for some i, 


then fi{c*) < Ci, while fj{c*) < Cj, for all j ^ i. 
Thus, which contradicts the 

assumption that A*’s utility is Next, suppose 

c* > Ci for some i. Then since A* is a valid assignment, 
we have < + Xl.es, < < C + C = XlLi c^, 

and so there exists j ^ i such that c* < Cj. But 
then fj{c*) < Cj and fk{cl) < Ck for all k ^ j, so 
A*’s total utility is < 1 ]"=! again a 

contradiction. Thus, we have c* = Ci for all i, and 
so E*gSi < + E*gS 2 Ci = sr=i Ci = 2 C'. So, since 
E*gSi < < C and E*gS2 < ^ *en E,gSi < = 
E^GS2 < = E^gSi Ci = EiGSs Ci = C, and 

Si , S2 solve the partition problem. 

For the only if direction, suppose , £'2 are a so¬ 
lution to the partition instance. Then since EiGSi Ci = 
EiGS 2 Ci ~ C”, we can assign the threads with indices in 
Si and S2 to servers 1 and 2, respectively, and get a valid 
assignment with utility Er=i /i(oi) = Er=i Ci- This is 
a maximum utility assignment for A, since fi(x) < Ci 
for all i. Thus, the partition problem reduces to the AA 
problem, and so the latter is NP-hard for two servers. ■ 

V. Approximation Algorithm 

In this section, we present an algorithm to find an 
assignment with total utility at least a = 2(\/2 — 1) > 
0.828 times the optimal in 0{mn? +n{\ogmC)^) time. 
The algorithm consists of two main steps. The first 
step transforms the utility functions, which are arbitrary 
nondecreasing concave functions and hence difficult to 
work with algorithmically, into a function consisting of 
two linear segments which is easier to handle. Next, we 
find an a-approximate optimal thread assignment for the 
linearized functions. We then show that this leads to an a 
approximate solution for the original concave problem. 

A. Linearization 

To describe the linearization procedure, we start with 
the following definition. 

Definition V.l. Given an AA problem A with m servers 
each with C amount of resources, and n threads with 
utility functions /i,..., a super-optimal allocation 
for A is a set of values Ci,..., c„ that maximizes the 
quantity Er=i/*(^t)' subject to Er=i — wC'- Call 
Er=i super-optimal utility/or A. 

To motivate the above definition, note that for any 
valid assignment [(ri, ci),..., (r„, c„)] for A, we have 
Ei=i Ci < mC, and so the utility of the assignment 
Er=i /i(ci) N at most the super-optimal utility of A. 
Let F* and F denote A’s maximum utility and super- 
optimal utility, respectively. Then we have the following. 

Lemma V.2. F* < F. 

Thus, to find an a approximate solution to A, it 
suffices to find an assignment with total utility at least 


aF. We note that the problem of finding F and the 
associated super-optimal allocation can be solved in 
0{n{\ogmC)^) time using the algorithm from fTh), 
since the fi functions are concave. Also, since these 
functions are nondecreasing, we have the following. 

Lemma V.3. Er=i G ~ 

In the remainder of this section, fix A to be an 
AA problem consisting of m threads with C resources 
each and n threads with utility functions /i,...,/„. 
Let [ci,..., Cn] be a super-optimal allocation for A 
computed as in We define the linearized version 
of A to be another AA problem B with the same set of 
servers and threads, but where the threads have piecewise 
linear utility functions ,..., defined by 


gi{x) = 


/i(ci)f if T < Ci 


/i(ci) 


otherwise 


( 1 ) 


Lemma V.4. For any i G T and x G [0,(7], /i(x) > 
gi{x). 

Proof: For x G [0, Ci], we have fi{x) > ^^fi{0) + 
fr/i(ci) > gi{x), where the first inequality follows 
because ft is concave, and the second inequality follows 
because /i(0) > 0. Also, for x > Ci, fi{x) > fi{ci) = 
gi{x). ■ 

B. Approximation algorithm for linearized problem 

We now describe an a approximation algorithm for 
the linearized problem. The pseudocode is given in Al¬ 
gorithm 1. The algorithm takes as input a super-optimal 
allocation [ci,..., c„] for A and the resulting linearized 
utility functions gi, ..., g^, as described in Section V-A 


Variable Cj represents the amount of resource left on 
server j, and R is the set of unassigned threads. The 
outer loop of the algorithm runs until all threads in R 
have been assigned. During each loop, U is the set of 
(thread, server) pairs such that the server has at least as 
much remaining resource as the thread’s super-optimal 
allocation. If any such pairs exist, then in line 6 we find 
a thread in U with the greatest utility using its super- 
optimal allocation, breaking ties arbitrarily. Otherwise, in 
line 9 we find a thread that can obtain the greatest utility 
using the remaining resources of any server. In both 
cases we assign the thread in line 12 to a server giving 
it the greatest utility. Lastly, we update the server’s 
remaining resources accordingly. 


C. Analyzing the linearized algorithm 

We now analyze the quality of the assignment pro¬ 
duced by Algorithm 1. We first define some notation. 
Let D = {i G TI Ci = Ci} be the set of threads 
whose allocation in Algorithm 1 equals its super-optimal 
allocation, and let F = T — F> be the remaining threads. 



Algorithm 1 

Input: Super-optimal 

allocation [ci, ..., c„], and 


gi, ... ,gn as defined in Equation [T] 

1 

Cj G- C for j = 1 

. . . ,TO 

2 

R {1, ... ,n} 


3 

while do 


4 

U G- {(*,j)|( 

i G R) /\ {1 < j < m) A {ci < 




5 

it U fib then 


6 

(ij) G- argmax(,_j)gy gfc^) 

7 

Ci G- Ci 


8 

else 


9 

{i,j) G- argmax^gy gfCj) 

10 

Ci i — Cj 


11 

end if 


12 

fi G- j 


13 

R^ R — {i} 


14 

Cj G- Cj - Ci 


15 

end while 


16 

return (n, ci), ... 

; (^n; ^n') 


We say the threads in D are full, and the threads in E 
are unfull. Note that full threads are the ones computed 
in line 6, and unfull threads are computed in line 9. The 
full threads have the same utility in the super-optimal 
allocation and the allocation produced by Algorithm 1. 
Thus, to show Algorithm 1 achieves a good approxima¬ 
tion ratio it suffices to show the utilities of the unfull 
threads in Algorithm 1 are sufficiently large compared 
to their utilities in the super-optimal allocation. We first 
show some basic properties about the unfull threads. 

Lemma V.5. At most one thread from E is assigned to 
any server. 


Proof: Lemma 
suffices to show \E 


V.5 implies that |i?| < m, so it 
f m. Assume for contradiction 


\E\ = TO. Then by Lemma V.5 for each server Sk there 
exists a thread ta G E assigned to Sk. ta receives all 
of Sfc’s remaining resources, and so ct = C' after 

its assignment. Then after all to threads in E have been 
assigned, we have ~ rnC. But since Ca < Ca 

for all ta G E, and Ci < Ci for all ti G T, we have 
~ mC, which is a contradiction. 
Thus, \E\ f TO and the lemma follows. ■ 


The next lemma shows that the total resources allo¬ 
cated to the unfull threads in Algorithm 1 is not too 
small compared to their super-optimal allocation. 


Lemma V.7. ^ E* 




Proof: We first partition the servers into sets U and 
V, where U = {j G S \ Sj C D} is the set of servers 
containing only full threads, and V = S — U are the 
servers containing some unfull threads. Let Cj = C — 
amount of unused resources on a server 
Sj at the end of Algorithm 1. Then Cj = 0 for all j G 
V, since the unfull thread in Sj was allocated all the 
remaining resources on Sj. So, we have J2jeu^j ~ 
~ ~ ^i) = 'tnC — 

and so Eigt ~ 

Next, we have = E^gc + EiGis = 

mC — + Thi^E equality follows 

because Ci = Ci for i G D, and the second equality 
follows because D U E = T and EiGT = rnC. 
Combining this with the earlier expression for 
we have mC - ^3 = mC - J^^eE c* + EiGB 

and so 

^Ci + ^Ci = ^Ci. ( 2 ) 

ieE ieu ieE 


Proof: Suppose for contradiction there are two 
threads ta,tb G E assigned to a server Sk, and assume 
that ta was assigned before t^. Consider the time of f^’s 
assignment, and let Sj denote the set of threads assigned 
to a server Sj. We have EiGSt because ta G E, 

and so it was allocated all of s^’s remaining resources 


in lines 10 and 14 of Algorithm 1. Also, 


ieSi 


= C 


for any j k. Indeed, if EiGS Ci < C for any j k, 
then Sj has more remaining resources than Sk, and so 
tb would be assigned to Sj instead of Sk because it can 
obtain more utility. Thus, together we have that when 
tb is assigned, ^ Ejli EigS^ 

Ci < Ci for all ti G T. Also, since ta,tb G E, then 
Ca < Ca and Cb < Cb- Thus, we have Eigt ^ 


E. 

E. 


iGT ' 
ieT ' 


> mC, which is a contradiction because 
= mC by Lemma V.3 


Lemma V.6. \E\ < TO — 1. 


Now, assume for contradiction that J^i^E 
^ EiGB Then by Equation we have 


ieu 


TO — \E\ 

TO 


iGE 


< 

(3) 


We have \V\ = \E\, since by Lemma [V5] each server in 
V contains only one unfull thread. Thus \U\ = to—| V| = 
TO — \E\. Using this in Equation]^ we have that there 
exists an j G U with 


- \u\ 


i&U 


1 

TO 


ieE 


(4) 


We claim that for all i G E,j G U, Ci > Cj. Indeed, 
suppose Ci < Cj for some i. But since Cj > Ci, ti should 
be allocated to Sj because it can obtain greater utility on 
Sj than its current server, which is a contradiction. Thus, 
Ci > Cj for all i G E. Using this and Equation we 














have 


Y^c.>Y.C, = \E\C,> 

ieE ieE 


\E\ 


ieE 


However, this contradicts the assumption that < 

— Ci- Thus, the lemma follows. ■ 

Let 7 = maxigBpi(ci) be the maximum super- 
optimal utility of any thread in E. The following lemma 
says that all of the first m threads assigned by Algorithm 
1 are given their super-optimal allocations and have 
utility at least 7 . 


Lemma V.8. Let ti be one of the first m threads assigned 
by Algorithm 1. Then f € D and gi{ci) > 7 . 

Proof: To show L G D, note that the m servers all 
had C resource at the start of Algorithm 1, and fewer 
than m threads were assigned before L. So when L was 
assigned, there was at least one server with C resource. 
Then ti can obtain Ci resource on one of these servers, 
and so f G D. 

To show gi{ci) > 7 , suppose the opposite, and let 
j G E ht such that gj{cj) = 7 . Since Cj < C, and since 
in fi’s iteration there was some server with C resource, 
then in that iteration Algorithm 1 would have obtained 
greater utility by assigning tj instead of f, which is a 
contradiction. Thus, gi {Ci) > 7 . ■ 

Lemma [V8] implies there are at least m threads in D, 
and so we have the following. 

Corollary V.9. J2ieD9i{ci) > m'j. 

The next lemma shows that for the threads in E, 
threads with higher slopes in the nonconstant portion 
of their utility functions are allocated more resources. 

Lemma V.IO. For any two threads i,j G E, if > 
then Ci > c,. 

Proof: Suppose for contradiction Ci < Cj, and 
suppose first that f was assigned before tj. Then when 
ti was assigned, there was at least one server with Cj 
or more remaining resources. We have Ci > cj, since 
otherwise ti can be allocated Ci resources, so that i ^ E. 
Now, since Ci > Cj > Ci, then ti could obtain greater 
utility by being allocated Cj instead of Ci amount of 
resources. This is a contradiction. 

Next, suppose tj was assigned before ti. Then when tj 
was assigned, there was a server with at least Cj amount 
of resources. Again, we have Ci > Cj. Indeed, otherwise 
we have Ci < Cj, and Cj > Cj since j G E, and so 
ti can be allocated its super-optimal allocation while 
tj cannot. But Algorithm 1 prefers in line 4 to assign 
threads that can receive their super-optimal allocations, 
and so it would assign ti before tj, a contradiction. Thus, 
Ci > Cj. However, this means that in the iteration in 
which tj was assigned, L can obtain greater utility than 


tj, since gi{cj) = Cj^^^ > gj{cj) = Cj^^^, where 
the first equality follows because Ci > Cj, the inequality 
follows because — iM > MM 

, and the second equality 
follows because Cj > Cj. Thus, L would be assigned 
before tj, a contradiction. The lemma thus follows. ■ 
The following facts are used in later parts of the 
proof. The first fact follows from the Cauchy-Schwarz 
inequality, and the second is Chebyshev’s sum inequality. 

Fact V.ll. Given ai,...,a„ > 0, we have 

Fact V.12. Given a,a',b,b' > 0, if then ^ < 

a-\-b ^ 

b'- 

Fact V.13. Given ai > a 2 >...> an and 61 > &2 > 
...>bn, we have afji > h)- 

We now state a lower bound on a certain function. 


Lemma V.14. Let A,d > 0, and 0 < oi < 02 ... < Un- 
Also, let /3 = (A + YJi^i aiZi)l{A -f Yh=i ^i)’ where 
each Zi G [0, d]. Then 


P > min 


A -f 
A + jd 


,1 


Proof: If oi > 1, then Fact V.12 implies that /3 > 1, 
and the lemma holds. Otherwise, suppose oi < 1. Then 
differentiating /3 with respect to zi, we get 


P'{zi) = 


(ai - 1)A + 


Since ai < 02 ... < a„ and oi < 1, we have /3'{zi) < 0. 
Thus, /3{zi) is minimized for zi = d, and we have 


A + aid + 

A A d ^2i=2 


To simplify this expression, suppose first that {A -f 
aid)/{A + d) < 02 . Then we have 


A + aid + Yh=2 ^ 

“ A + d + J27=2 ~ A + d ' 

The second inequality follows because 02 < ... < a„ 
and by Fact |V.12l Thus, the lemma is proved. Otherwise, 
{A + aid)/{A + d) > 02 , and so 


A -f aid -f Y2i=2 ^ X/i =3 

A + d -\- X/i=2 A-\-2d -\- Zi 

We can simplify the latter expression in a way similar to 
above, based on whether (A-|-aid-|-a 2 d)/(A-|- 2 (i) < 03 . 
Continuing this way, if we stop at the j’th step, then 
fi > {A + J2i=i aid)/(A + jd). Otherwise, after the 
n’th step, we have /3 > (A -f 'Y^^=i aid)/{A -f nd). In 
either case, the lemma holds. ■ 

Algorithm 1 produces an allocation ci,...,c„ with 
total utility G = 9i{h) + J2 ige We now 























prove that this allocation is an a approximation to the 
super-optimal utility F = 

Lemma V.15. G > aF, where a = 2{\/2 — 1) > 0.828. 

Proof: We have F = E^GT/*(c^) = 9^i.Ci) 

by the definition of the p^’s. Thus, 

G _ Yi^D + Yi^E 

F YieD 9iiG) + YieE gIg) 

^ ml + Y^eE ^G 

mj + Yii=E9i(G) 

^ m7+(Y.,EGmY.,E^ 

mJ + Y^eE9^(G) 

^ ^7 + (Y^eE G/m) ^ 

m7 + Y^eE9^{G) 


Recall that 7 = max^g^; pi(ci). The first inequality 
follows because of Corollary |V.9| and because Ci > Ci for 
all i. The second inequality follows because by Lemma 


V.IO 


threads i £ E with larger values of 




also have 


larger values of c^. Thus, we can apply Fact |V. 13 to bring 
the term YieE g/\E\ outside the sum Yi(: 




I L^i^E _ 

The last inequality follows because of Lemma |V.7| Now, 
assume WLOG that the elements in E are ordered by 
nonincreasing value of Ci, so that F < F < ... < 

Let Ei denote the first i elements of E in this order. 
For any i G E, we have gi{ci) £ [ 0 , 7 ]. Thus, applying 
Lemma V.14 to the last expression above, letting pi(ci) 
play the role of z, and 

r j t mCi 

noting 2 < 1 , we have 

F 


play the role of a^, and 


G 

F 


> min 


> min 


> min 


'm7+ {Yj^ECj/m^ Y 


i=i,...,|£;| \ m F I 



earized problem to obtain an allocation ci,..., c„, then 
simply output this as the solution to the concave prob¬ 
lem. The total utility of this solution is F = Yi^T fi(.G)- 
We now show this is an a approximation to the optimal 
utility F*. 


Theorem V.16. F > aF*, and Algorithm 1 achieves an 
a approximation ratio. 


Proof: We have F 

Yi^T9i{G) > OiF > aF* 
follows because fi{ci) > gi{ci) by Lemma V.4 


“ Yi^T fi^G) ^ 

where the first inequality 
the 


second inequality follows by Lemma |V. 15| and the last 
inequality follows by Lemma |V.2| ■ 

Next, we give a simple example that shows our 
analysis of Algorithm 1 is nearly tight. 


Theorem V.17. There exists an instance of AA where 
Algorithm 1 achieves | > 0.833 times the optimal total 
utility. 


Proof: Consider 3 threads, and 2 servers each with 
one (divisible) unit of resource. Let 

11 if tc > 2 

Also, let f2{x) = X. Suppose the first two threads both 
have utility functions /i, and third thread has utility 
function / 2 . The super-optimal allocation is [ci, £ 2 , £ 3 ] = 
[^,^,1]. Algorithm 1 may assign threads 1 and 2 to 
different servers, with ^ resource each, then assign 

thread 3 to server 1 with i resource. This has a total 

•1 ^ 

utility of 2^. On the other hand, the optimal assignment 
is to put threads 1 and 2 on server 1 and thread 3 on 
server 2. This has a utility of 3. ■ 

Lastly, we analyze Algorithm I’s time complexity. 

Theorem V.18. Algorithm 1 runs in 0{mnf + 
n(logmC')^) time. 


Proof: Computing the super-optimal allocation 
takes 0{n{logmGY) time using the algorithm in fTh). 
Then the algorithm runs n loops, where in each loop 
it computes the set U with 0{mn) elements. Thus, the 
theorem follows. ■ 


The second inequality follows by simplification and 

because Yipe^J — Yipe ^j *■ 

inequality follows by Fact |V 11 It remains to lower 
bound the final expression. Treating f as a real value 
and taking the derivative with respect to i, we find the 
minimum value is obtained at z = (>72 — F)m, for which 
2 > 2 ( 1/2 — 1) = a. Thus, the lemma is proved. ■ 


D. Solving the concave problem 

To solve the original AA problem with concave utility 
functions /i,...,/„, we run Algorithm 1 on the lin- 


VI. A Faster Algorithm 


In this section, we present a faster approximation 
algorithm that achieves the same approximation ratio as 
Algorithm 1 in 0{n{\ogmG)^) time. The pseudocode 
is shown in Algorithm 2. 

The algorithm also takes as input a super-optimal 
allocation £ 1 ,... ,£„, which we compute as in Section 


V-A It sorts the threads in nonincreasing order of giifii). 
It then takes threads m -I- 1 to n in this ordering, and 
resorts them in nonincreasing order of gi{ci)lci. Next, 
it initializes Ci,..., Gm to G, and stores them in a 




























max heap H. Cj represents the amount of remaining 
resources on server j. The main loop of the algorithm 
iterates through the threads in order. Each time it chooses 
the server with the most remaining resources, allocates 
the minimum of the thread’s super-optimal allocation and 
the server’s remaining resources to it, and assigns the 
thread to the server. Then H is updated accordingly. 


Algorithm 2 

Input: Super-optimal allocation [ci,...,c„], and 
gi,... ,gn as defined in Equation [T] 

1 : sort threads in nonincreasing order of gi{ci) as 

j•■■; tn 

2: sort tm+ 1 ,..., fn in nonincreasing order of gi{ci)lci 
3: Cj 3— C for j = 1,... ,m 
4: Store Cl,, Cm in a max-heap H 

5: for i = 1,... ,n do 

6: j ^ argmaxi<j<„ Cj 

7: Ci ^ min(ci, Cj) 

8: Cj -(r- Cj — Ci, and update H 

9: Ti ^ j 

10: end for 

11: return (n, Cl),..., (r„,c„) 


A. Algorithm analysis 

We now show Algorithm 2 achieves a 2(v^ — 1) 
approximation ratio, and runs in 0(n(logmC)^) time. 
The proof of the approximation ratio uses exactly the 
same set of lemmas as in Section IV-AI and IV-BI The 
proofs for most of the lemmas are also similar. Rather 
than replicating them, we will go through the lemmas 
and point out any differences in the proofs. Please refer 
to Sections |V-A| and |V-B| for the definitions, lemma 
statements and original proofs. 

• Lemmas V.2, V.3, V.4 These lemmas deal with 
the super-optimal allocation, which is the same in 
Algorithms 1 and 2. 

• Lemma V.5 The proof of this lemma depended on 
the fact that in Algorithm 1 if we assign a second E 
thread to a server, then all the other servers have no 
remaining resources. This is also true in Algorithm 
2, since in line 6 we assign a thread to a server with 
the most remaining resources, and so if when we 
assign a second E thread f to a server s there was 
another server s' with positive remaining resources, 
we would assign t to s' instead, a contradiction. 

• Lemma V.6 This follows by exactly the same argu¬ 
ments as the original proof. 

• Lemma Y.7 The only statement we need to check 
from the original proof is that for alH S i? we have 
Ci > Cj. But this is true in Algorithm 2 because 
if there were any ci < Cj, line 6 of Algorithm 


2 would assign thread i to server g instead of 
i’s current server, a contradiction. All the other 
statements in the original proof then follow. 

• Lemma V.8 This follows because lines 1 and 2 of 
Algorithm 2 show that the first m assigned threads 
have at least as much super-optimal utility as the 
remaining n — m threads. Also, the first m threads 
must be in D, since there is always a server with C 
resources during the first m iterations of Algorithm 
2. Thus, all threads in E are among the last n — m 
assigned threads, and their maximum super-optimal 
utility is no more than the minimum utility of any 
D thread. 

• Corollary V.9 This follows immediately from 
Lemma V.8. 

• Lemma V.IO As we stated above, all threads in 
E must be among the last n — m assigned by 
Algorithm 2. That is, they are among threads 
tm+i, ■ ■ ■ ,tn. In line 2 these threads are sorted in 
nondecreasing order of gi{ci)jci. Thus, the lemma 
follows. 

• Facts V.ll to V.13, Lemma V.14 These follow inde¬ 
pendently of Algorithm 2. 

• Lemma V.15 The proof of this lemma used only the 
preceding lemmas, not any properties of Algorithm 
1. Thus, it also holds for Algorithm 2. 


Given the preceding lemmas, we can state the approxi¬ 
mation ratio of Algorithm 2. The proof of the theorem is 
the same as the proof of Theorem V.16 and is omitted. 


Theorem VI.l. Let F be the total utility from the 
assignment produced by Algorithm 2, and let F* * be the 
optimal total utility. Then F > aF*. 


Lastly, we analyze Algorithm 2’s time complexity. 

Theorem VI.2. Algorithm 2 runs in OlTillograCY) 
time. 


Proof: Einding the super-optimal allocation takes 
0{n{\ogmC)'^ time using the algorithm in ifThl . Steps 

1 and 2 take 0(n log n) time. Since C is usually large 
in practice, we can assume that logn = O(logmC)^. 
Each iteration of the main for loop takes O(logm) time 
to extract the maximum element from H and update H. 
Thus, the entire for loop takes O(nlogm) time. Thus, 
the overall running time is dominated by the time to find 
a super-optimal allocation, and the theorem follows. ■ 

VII. Experimental Evaluation 

In this section we evaluate the performance of our 
algorithms experimentally. As both Algorithms 1 and 

2 have the same approximation ratio, we only evaluate 
Algorithm 2. We compare the total utility the algorithm 
achieves with the super-optimal (SO) utility, which is at 
least as large as the optimal utility. We also compare the 















algorithm with several simple but practical heuristics we 
name UU, UR, RU and RR. The UU (uniform-uniform) 
heuristic assigns threads in a round robin manner to the 
servers, and allocates the threads assigned to a server 
the same amount of resources. UR (uniform-random) 
assigns threads in a round robin manner, and allocates 
threads a random amount of resources on each server. RU 
(random-uniform) assigns threads to random servers, and 
equally allocates resources on each server. Finally, RR 
(random-random) randomly assigns threads and allocates 
them random amounts of resource. 

Our experiments use threads with random concave 
utility functions generated according to various prob¬ 
ability distributions, as follows. We hx an amount of 
resources C on each server, and set the value of the 
utility function at 0 to be 0. Then we generate two values 
V and w according to the distribution H, conditioned on 
w < V, and set the value of the utility function at ^ to v, 
and the value at C to u-fw. Lastly, we apply the PCHIP 
interpolation function from Matlab to the three generated 
points to produce a smooth concave utility function. In 
all the experiments, we set the number of servers to be 
TO = 8, and test the effects of varying different param¬ 
eters. One parameter is /3 = ^, the average number 
of threads per server. All the experiments run quickly 
in practice. Using to = 8, n = 100 and C = 1000, 
an unoptimized Matlab implementation of Algorithm 2 
finishes in only 0.02 seconds. The results below show 
the average performance from 1000 random trials. 

A. Uniform and normal distributions 

We hrst consider Algorithm 2’s performance com¬ 
pared to SO, UU, UR, RU and RR on threads with utility 
functions generated according to the uniform and normal 
distributions. We set the mean and standard deviation 
of the normal distribution to be 1 and 1, respectively. 
Figures |l(a)| and |l(b)| show the ratio of Algorithm 2’s 
total utility versus the utilities of the other algorithms, 
for f3 varying between 1 to 15. The behaviors for both 
distributions are similar. Compared to SO, our perfor¬ 
mance never drops below 0.99, meaning that Algorithm 
2 always achieves at least 99% of the optimal utility. The 
ratios of our total utility compared to those of UU, UR, 
RU and RR are always above 1, so we always perform 
better than the simple heuristics. For small values of 
(3, UU performs well. Indeed, for /3 = 1, UU achieves 
the optimal utility because it places one thread on each 
server and allocates it all the resources. UR does not 
achieve optimal utility even for /? = 1, since it allocates 
threads random amounts of resources. RU and RR may 
allocate multiple threads per server, and also do not 
achieve the optimal utility. As /? grows, the performance 
of the heuristics gets worse relative to Algorithm 2. 
This is because as the number of threads grows, it 
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Fig. 1. Performance of Algorithm 2 versus SO,UU,RU,UR, and RR 
as a function of ^ under the uniform and normal distributions. 

becomes more likely that some threads have very high 
maximum utility. These threads need to be assigned and 
allocated carefully. For example, they should be assigned 
to different servers and allocated as much resources as 
possible. The heuristics likely fail to do this, and hence 
obtain low performance. The performance of UR and 
RR, as well as those of UU and RU converge as /3 grows. 
This is because both random and uniform assignments 
assign the threads roughly evenly between the servers 
for large (3. Also, the performance of UU and RU is 
substantially better than UR and RR, which indicates 
that the way in which resources are allocated has a bigger 
effect on performance than how threads are assigned, and 
that uniform allocation is generally better than random 
allocation. 

B. Power law distribution 

We now look at the performance of Algorithm 2 
using threads generated according to the power law 
distribution. Here, each value x has a probability Aa;““ 
of occurring, for some a > 1 and normalization factor 
A. Figure [2(a)] shows the effect of varying (3 while hxing 
a = 2. Here we see the same trends as those under the 
uniform and normal distributions, namely that Algorithm 
2 always performs very close to optimal, while the 
performance of the heuristics gets worse with increasing 
/3. However, the rate of performance degradation is faster 
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Fig. 2. Performance of Algorithm 2 versus SO,UU,RU,UR, and RR 
as a function of /3 and a. under the power law distribution. 


than with the uniform and normal distributions. This is 
because the power law distribution with a = 2 is more 
likely to generate threads with very different maximum 
utilities. These threads must be carefully assigned and 
allocated, which the heuristics fail to do. For /3 = 15, 
Algorithm 2 is 3.9 times better than UU and RU, and 
5.7 times better than UR and RR. 

Figure [2(b)] shows the effect of varying a, using a fixed 
fi = h. Algorithm 2’s performance is nearly optimal. In 
addition, the performance of the heuristics improves as 
a increases. This is because for higher values of a, it is 
unlikely that there are threads with very high maximum 
utilities. So, since the maximum utilties of the threads 
are roughly the same, almost any even assignment of the 
threads works well. Despite this, we still observe that UU 
and RU perform better than UR and RR. This is because 
when the threads are roughly the same, then due to the 
concavity of the utility functions, the optimal allocation 
is to give each thread the same amount of resources. This 
is done by UU and RU, but not by UR and RR, which 
allocate resources randomly. 

C. Discrete distribution 

Lastly, we look at the performance using utility func¬ 
tions generated by a discrete distribution. This distribu¬ 
tion takes on only two values £, h, with £ < /i. 7 is a 
parameter that controls the probability that £ occurs, and 
9 = j is a parameter that controls the relative size of 
the values. Figure |3(a)] shows Algorithm 2’s performance 


as we vary /3, fixing 7 = 0.85 and 6 — 5. The same 
trends as with the other distributions are observed. Figure 
3(b) shows the effect of varying 7 , when /3 = 5 and 


9 = 5. Our algorithm achieves the lowest performance 
for 7 = 0.75, when we achieve 97.5% of the super- 
optimal utility. The four heuristics also perform worst 
for this value. For 7 close to 0 or 1, all the heuristics 
perform well, since these correspond to instances where 
either /i or £ is very likely to occur, so that almost all 
the threads have the same maximum utility. Lastly, we 
consider the effect of varying 9. Here, as 9 increases, 
the difference between the high and low utilities becomes 
more evident, and the effects of poor thread assignments 
or misallocating resources become more serious. Hence, 
the performance of the heuristics decreases with 9. 
Meanwhile, Algorithm 2 always achieves over 99% of 
the optimal utility. 


VIII. Conclusion 

In this paper, we studied the problem of simultane¬ 
ously assigning threads to servers and allocating server 
resources to maximize total utility. Each thread was 
modeled by a concave utility function. We showed that 
the problem is NP-hard, even when there are only 
two servers. We also presented two algorithms with 
approximation ratio 2 (v^ — 1 ) > 0.828, running in 
times 0{mn? + n(logmC')^) and 0{n{\ogmC)'^), re¬ 
spectively. We tested our algorithms on multiple types 
of threads, and found that we achieve over 99% of the 
optimal utility on average. We also perform up to 5.7 
times better than several heuristical methods. 

In our model we considered homogeneous servers 
with the same amount of resources and also a single 
resource type. In the future, we would like to extend 
our algorithm to accommodate heterogeneous servers 
with different capacities and multiple types resources. 
Also, in practice the utility functions of threads may 
change over time. Thus, we would like to integrate 
online performance measurements into our algorithms 
to produce dynamically optimal assignments. Finally, 
we are interested in applying our methods in real-world 
systems such as cloud computers and datacenters. 
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